Instrument the realm-server search path with a request→response timeline (CS-11363)#5098
Instrument the realm-server search path with a request→response timeline (CS-11363)#5098habdelra wants to merge 2 commits into
Conversation
…ine (CS-11363) During a from-scratch index, prerendered cards block for tens of seconds on in-render `_search` round-trips while the SQL behind them runs in milliseconds. We could see the client wait (host-side `boxel_index.diagnostics`) but had no server-side view of where those seconds went. This adds that view: - A client-minted `x-boxel-request-id` correlation id, stamped on the prerendered host's `_federated-search` fetch (prerender-gated; live SPA traffic is untouched). Join key only — no new client measurement. - A `realm:search-timing` line keyed by that id with the request→response stage breakdown: parse, resolveRealms, sql, loadLinks, populate (relationship/query-field assembly), per-instance cache read/write + hit/miss, stringify, and result count. Emitted by `handle-search` for the full handler picture, and standalone by `searchRealms` on the host-test path. Gated on a correlation id, so normal traffic allocates nothing. - Request duration + the correlation id on the `realm:requests` log lines (the outermost request→response bound, incl. body read + send). - A `realm:health` sampler: event-loop lag (mean/p99/max) alongside the in-flight `_search` count and heap, logged only during saturation windows — the signal that confirms or refutes single-thread saturation directly. The timing collector is threaded through the shared `SearchOpts` and is a no-op without a correlation id; `RealmIndexQueryEngine` only ever runs in the realm-server, so the host stays unaffected. A host integration test exercises the whole path end-to-end: the in-realm browser mints the id, issues a real `store.search`, and the test asserts the id lands in the server's `realm:search-timing` line (and that live, non-prerender traffic stamps nothing and emits nothing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 498aab2c5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds end-to-end, request-correlated timing instrumentation for the realm-server federated search pipeline so slow _search calls can be attributed to specific server-side stages (parse, realm resolution, SQL, link loading/population, cache read/write, stringify) and correlated with request logs and client diagnostics.
Changes:
- Introduces a
RequestTimingscollector and threads it throughsearchRealms→Realm.search→RealmIndexQueryEngineto stamp stage durations + counters and emit a singlerealm:search-timingline keyed by a client-minted correlation id. - Adds a prerender-only
x-boxel-request-idheader minted by the host SPA and sanitized/echoed by the realm-server request middleware, includingdur=in therealm:requestslog for full request→response timing. - Adds a realm-server health sampler that periodically records event-loop lag + in-flight search count during saturation windows, plus integration coverage verifying header stamping and timing-line emission behavior.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| packages/runtime-common/search-utils.ts | Extends SearchOpts with correlation/timings, adds a testable timing-log emitter, and optionally emits a per-request timing line when it “owns” the collector. |
| packages/runtime-common/request-timings.ts | New request-scoped stage timer + counters with compact log formatting. |
| packages/runtime-common/realm.ts | Threads timings through realm search options into the query engine. |
| packages/runtime-common/realm-index-query-engine.ts | Times sql, loadLinks, populate, and per-instance cache read/write; increments cache hit/miss counters. |
| packages/runtime-common/prerender-headers.ts | Defines/sanitizes x-boxel-request-id to prevent log forging and provide a shared header name. |
| packages/runtime-common/index.ts | Exports RequestTimings from runtime-common’s public surface. |
| packages/realm-server/search-inflight.ts | New module-level in-flight _search counter for health sampling correlation. |
| packages/realm-server/middleware/index.ts | Adds req= correlation + dur= to request logs, echoes sanitized request id, and tracks in-flight search lifecycle (finish/close). |
| packages/realm-server/main.ts | Starts/stops the health sampler with the server lifecycle. |
| packages/realm-server/health-sampler.ts | New event-loop lag sampler that logs realm:health only during saturation windows. |
| packages/realm-server/handlers/handle-search.ts | Creates/threads request timing collector (when correlation id present), stamps parse/resolve/stringify, and emits the full timeline line (incl. cache-hit paths). |
| packages/host/tests/integration/search-correlation-id-test.gts | New integration test asserting prerender-only header stamping and server timing-line correlation. |
| packages/host/tests/helpers/realm-server-mock/routes.ts | Threads sanitized request id into searchRealms in the mock so tests observe real timing emission. |
| packages/host/app/services/store.ts | Adds requestIdHeader() to _federated-search requests. |
| packages/host/app/lib/prerender-fetch-headers.ts | Implements prerender-gated per-request correlation id minting with crypto.randomUUID fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ed-search timing - CORS: add the correlation header to the realm-server's `Access-Control-Allow-Headers` allowlist. Without it the cross-origin preflight blocks the prerender's `_federated-search` once the host stamps the header. - Rename `x-boxel-request-id` → `x-boxel-logging-correlation-id` (and the internal field/vars to `loggingCorrelationId`, the log token to `corr=`): the old name read too much like the unrelated `x-boxel-client-request-id`, which is realm-event/write correlation, a different concern. - searchCards in-flight coalescing: exclude the per-request `timings` collector from `searchInFlightKey` so it can't perturb the dedup key, and when a follower awaits an already-running identical search, record that wait as `coalescedWait` on the follower's own collector — otherwise its timing line would show no sql/loadLinks and look misleadingly instant under concurrent search load. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What this does
Adds server-side timing instrumentation to the realm-server's federated-search path, so we can answer where a slow
_searchspends its time — from request receipt to response sent ("hoof to snout") — instead of guessing.Why
During a from-scratch index of a query-field-heavy realm, prerendered cards stall in the
waiting-stabilitystage for tens of seconds — up to the 90s render timeout — waiting on an in-render_searchround-trip (a query-backedlinksToManygetter, e.g. a customer'spoliciesfield). The card's ownqueryLoadsInFlightdiagnostics show the search in flight for ~88s, yetpg_stat_activityshows zero SQL running longer than a second. So the time is not query execution — the realm-server (a single Node process) is slow to deliver the response, and we had no way to see why.The gap: the host-side
boxel_index.diagnosticscolumn already captures the client's view (queryLoadsInFlight/recentQueryLoads— that's how we know it waited 88s), but the server side of that same request was completely unmetered. The whole pipeline — receipt → SQL → the post-SQLloadLinksrelationship/query-field assembly → the per-instance cache round-trips →JSON.stringifyof the wire-format → send — emitted nothing, there was no per-request duration, and nothing linked the client's view to the server's.How it fits together
The search request flows: prerendered host SPA issues
_federated-search→ realm-serverhandle-search→searchRealms→ per-realmRealm.search→RealmIndexQueryEngine.searchCards→loadLinks→ SQL. The instrumentation threads a smallRequestTimingscollector down that existing path (via the sharedSearchOpts, alongside the job-id/priority already threaded there) and emits one line per request.The four pieces:
Correlation id (client → server).
requestIdHeader()mintsx-boxel-request-idon the prerendered host's_federated-searchfetch, gated on the prerender context exactly like the existing job-id / consuming-realm headers — so live SPA traffic stamps nothing. It's a join key only; no new client-side measurement (the client wait is already in the diagnostics column). The realm-server reads it back out and keys its timeline on it.Request→response timeline.
handle-searchowns aRequestTimingscollector (created only when a correlation id is present) and stampsparse,resolveRealms, andstringifydirectly;searchRealms/searchCards/loadLinksstampsql,loadLinks,populate(the relationship/query-field assembly), and the per-instancecacheRead/cacheWrite+ hit/miss counts onto the same collector. The handler emits onerealm:search-timingline:req=… job=… handler=Nms parse=… resolveRealms=… sql=… loadLinks=… populate=… stringify=… | results=… cacheHit=… cacheMiss=…. The correlation id is deliberately kept out of the job-scoped search cache's key material so it can't blow up the cache.Request duration + id in the request log. The
realm:requestsmiddleware now logs the<-- … -->round-tripdur=and thereq=id — the outermost request→response bound (includes body read + send), and the join key tying the timeline line, the request line, and the client diagnostics together.Event-loop health sampler. A periodic
realm:healthline (only during saturation windows) reporting event-loop lag (mean/p99/max) alongside the in-flight_searchcount and heap. This is the signal that confirms or refutes the prime suspect for the 88s — the single-threaded process's event loop being starved (by synchronous post-SQL serialization across many concurrent searches) so requests sit unserviced. The gap between the middlewaredur=total and the handler's stage sum, read against event-loop lag at the time, localizes queue-before-handler time.Everything is gated on the presence of a correlation id, so normal/live traffic allocates no collector and emits no extra lines.
RequestTimingsis threaded through runtime-common but only ever runs in the realm-server (RealmIndexQueryEngineneeds the Postgres adapter the browser doesn't have), so the host is unaffected.Test
A host integration test exercises the whole path end-to-end through the in-realm browser: the SPA mints the correlation id (real
requestIdHeader()), issues a realstore.search, and the test asserts the id lands in the server'srealm:search-timingline (with thesql/loadLinksstage breakdown), via the realm-server mock handing the header to the realsearchRealms. A second case asserts that non-prerender traffic stamps no header and emits no line.Scope
Measurement only — this localizes where the latency is so we can fix it; it does not change the search behavior or fix the latency itself.
Linear: CS-11363